-
-
Notifications
You must be signed in to change notification settings - Fork 31.8k
gh-127266: avoid data races when updating type slots v2 #133177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
In the free-threaded build, avoid data races caused by updating type slots or type flags after the type was initially created. For those (typically rare) cases, use the stop-the-world mechanism. Remove the use of atomics when reading or writing type flags. The use of atomics is not sufficient to avoid races (since flags are sometimes read without a lock and without atomics) and are no longer required.
To avoid deadlocks while the world is stopped, we need to avoid calling APIs like _PyObject_HashFast() and _PyDict_GetItemRef_KnownHash(). Collect the slot updates to be done and then apply them all at once. This reduces the amount of code running in the stop-the-world condition.
🤖 New build scheduled with the buildbot fleet by @nascheme for commit d511ca6 🤖 Results will be shown at: https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F133177%2Fmerge If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again. |
Now that stop-the-world is used to do the slot update, these tests are a lot slower in the free-threaded build. Test with fewer items, which will still hopefully be enough to find bugs in the specializer.
The clearing of Py_TPFLAGS_HAVE_VECTORCALL must be done when the world is stopped too.
🤖 New build scheduled with the buildbot fleet by @nascheme for commit 3cb2256 🤖 Results will be shown at: https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F133177%2Fmerge If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again. |
Since we stack allocate one chunk, we need to check 'n' to see if there are actually any updates to make. It's pretty common that no updates are actually needed.
@@ -576,6 +576,7 @@ class TestRacesDoNotCrash(TestBase): | |||
# Careful with these. Bigger numbers have a higher chance of catching bugs, | |||
# but you can also burn through a *ton* of type/dict/function versions: | |||
ITEMS = 1000 | |||
SMALL_ITEMS = 100 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth investigating this further. I'm surprised there's a large slowdown in this PR, but not in the earlier version. What's the relevant difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I timed the most recent version of this PR vs "main", using SMALL_ITEMS = 1000
. It seems the slowdown is not too bad now. I've left it at 100 but we could revert to 1000 if we want to.
I wasn't able to exactly pin-point the reason for the huge slowdown before. Based on "samply" profiling, it looked like there was a lot of contention for the TYPE_LOCK
mutex and the STM mutex. Quite a lot of time was spent in _Py_yield()
. That's kind of expected since the two writers are both trying to acquire both of those.
@colesbury Not sure if you would like this approach but it is a variation on your idea to modify the critical section to prevent release of the mutex. I changed it to work with a PyCriticalSection2 as well. https://github.com/nascheme/cpython/tree/type-slot-ts-release-hack Not using a critical section for the type dict looks kind of difficult to do. For example, |
The |
See GH-133467 for some remaining data race issues. |
If the two mutex form of the critical section is used, need to put the other mutex into '_cs_mutex'.
🤖 New build scheduled with the buildbot fleet by @nascheme for commit 3f6222b 🤖 Results will be shown at: https://buildbot.python.org/all/#/grid?branch=refs%2Fpull%2F133177%2Fmerge If you want to schedule another build, you need to add the 🔨 test-with-buildbots label again. |
This is an updated version of GH-131174, which was reverted. I figured the cleanest thing to do is make a new PR.
This is the same as the previous PR with the following additional change. The
update_all_slots()
andtype_setattro()
functions are now more careful when the world is stopped. Instead of doing the MRO lookups while the world is stopped, we do them all first and collect the slot pointers to be updated. Then, we stop the world and do those updates. This makes it much easier to confirm the code running during the stop-the-world is safe and that should avoid the deadlocks.The
test_opcache
test has become quite a bit slower. It seems to be due to mutex contention in the__getitem__
and__getattribute__
method assignment tests. I reduced the items count from 1000 to 100 to keep the test from becoming much slower.